Predicting location based on WIFI fingerprints
Executive summary
In the current study I have described the main issues with the provided datasets and provided results for the 2 machine learning algorithms (KNN and Random Forest).
Results, acquired by random Forest model for the chosen preprocessed dataset, are closer to the real ones. Nevertheless, there are still some quite high errors for predicting Latitude and Longitude, which require further data analysis and another preprocessing iteration in order to improve final results.
Goals
The main goals of current study are:
To investigate the feasibility of using “wifi fingerprinting” to determine a person’s location in indoor spaces.
To evaluate multiple machine learning models to see which produces the best result.
Provide recommendations, based on your own research on indoor locationing, of how the results might be improved.
Explore the Data
Signal Strength Intensity
The strongest signal for validating set is - 34dBm, while trainig set have signals above - 30dBm.
Moreover there are several WAPs, which are sending signals above - 30dBm at the same time in the same location. This is impossible in the real life conditions (for more information check here: https://www.metageek.com/training/resources/understanding-rssi.html).
Another anomalous observation - are signals provided by User ID 6, phone 19 in the Building 3, Floor 4&5. All of them are above -30dBm, which is unreal (green color on the chart below).
At the same time, if we will remove observations provided by User ID 6, we will significantly reduce signals for Building 3, 5th floor.
Therefore I have kept these observations.
Click - unclick User ID nr. 6 on the chart below to see changes in the coverage on the 5th floor of the 3rd building
! For the chart below I have used dataset after the 1st preprocessing.
I have observed strange signal patterns from phone ID 11 and 17, which is also visible on the signal frequency chart for building 3 and floors 2 & 5.
Additional investigation is required for the previous point. No specific actions were taken with the data regarding this point during current model training.
WAPs with the missing signal
There were detected some WAPs which are not sending any signals, as well as WAPs, which are sending signals to the different buildings and even floors.
General information about WAP signals
WAPs location
Also there were detected some WAPs, which are located in the different buildings in the training and validation sets:
Relocated WAPs. Training and Validationg datasets
This anomaly was detected after I have done first preprocessing and trained the models. Mainly all those WAPs emited signals below -80 dBm, so they were automatically removed during the first preprocessing, except WAP216, which appeared in the training set.
Coverage
First of all, we can see, that training set have blind spots - areas with no wifi fingerprints provided.
Also, as it is displayed on the charts below we can see, that coverage (qty of WAPs sending good signal) is not distributed equally. Such locations as 5th floor of the 3rd Building or some spots in the 2nd building do not have enough signals, so it will impact negatively on predicted results.
Preprocessing
I have tried different preprocessed datasets and would like to focus on 2 of them.
First preprocessing
For the first preprocessing I have proceeded as below:
I have combined training and validation sets. As it was mentioned before, training set did not contain all the necessary WiFi fingerprints, therefore I have used validation dataset to cover empty spots.
Replaced all values higher, than – 30 dBm to == – 30 dBm. Check Signal Strength Intensity for more details
Replaced all values lower, than – 80 dBm to == 100 dBm. -80dBm - minimum signal strength for basic connectivity. Packet delivery may be unreliable.
Removed totally duplicated rows
Removed rows & columns for WAPs which contain ONLY 100 dBm (do not emit any signal)
I have discretized Longitude and Latitude in 10 bins in order to use new categorical variables for stratified sampling.
To increase calculation time I took stratified sample of cleaned dataset and divided it into training and testing sets.
Second preprocessing
For the second dataset, additionally to the previous steps I have decided to:
- Remove signals, which are above – 30 dBm instead of replacing them with -30 dBm .
- Remove all WAPs, which were re-located (Nr. 55,56,195,196,216)
- Remove WAPs, which are sending signals to the different buildings
After I have plotted cleaned data from the second preprocessing I have realized, that there are some areas with too few signals, for example, 2nd building. So I have decided to drop 3rd step and to keep all WAPs, which are sending signals to the different buildings.
Below chart represents signals by those WAPs, which are sending signals to only one building.
Going forward, I would like to mention, that results for the models with the 2nd preprocessing were not improved, so further description will be focused on the models with the 1st preprocessing.
Training models & Error assessment
I have trained models with 2 algorithms – KNN and random Forest.
I have also tried knn3 and kknn (kernel =”triangular”) algorithms, but error metrics for predicting Building were lower, than for KNN and Random Forest, so I have decided not to proceed with these algorithms and focused only on KNN and Random Forest.
I have decided to use cascade model for predicting each parameter:
- First I have predicted the Building;
- Then, based on predicted Building I have divided training and testing sets into subsets for each building separately;
- Later on, for each subset I predicted the Floor, Latitude and Longitude.
After predicting Building and analysing error metrics, provided below, I have chosen to proceed with the predicted values for Building by Random Forest algorithm.
Below there is a summary table for Accuracy & Kappa & confusion matrix for predicting categorical variables such as Building and Floor.
Error metrics for classification algorithms - 1st preprocessing
Error metrics for classification algorithms - 2nd preprocessing
Summary for error metrics for Latitude and Longitude:
As we can observe from the residuals plot - each building have some errors. The highest errors are for the Longitude of the 2nd Building, so for the further investigation, at first these errors need to be checked.
On the plots below correct and wrong prediction for Building and Floor.
## No trace type specified:
## Based on info supplied, a 'scatter3d' trace seems appropriate.
## Read more about this trace type -> https://plot.ly/r/reference/#scatter3d
## No scatter3d mode specifed:
## Setting the mode to markers
## Read more about this attribute -> https://plot.ly/r/reference/#scatter-mode
## No trace type specified:
## Based on info supplied, a 'scatter3d' trace seems appropriate.
## Read more about this trace type -> https://plot.ly/r/reference/#scatter3d
## No scatter3d mode specifed:
## Setting the mode to markers
## Read more about this attribute -> https://plot.ly/r/reference/#scatter-mode
Histogram below demostrates quantity of incorrectly classified Buildings & Floors by different PhoneID. It is need to mention, that only 2 phone IDs are repeated in the both, training and validation sets - Phone ID 13 & 14.
Plot below demonstrates distribution of absolute errors for predicted latitude and longitude for both models for the Building 1:
As we can see both models have some sporadic high errors, but Random forest provide results with less errors for all variables. So as a final model for the current subset I have chosen Random Forest model.
Next steps
For the next steps I would like to:
- To find better way to discretize Longitude and Latitude for stratified sample.
- To investigate more WAPs, which are sending signals to the different buildings.
- To check errors in the context of phone models.
- To check in details segments of errors and to re-process data according to new insights.
Recommendations
It is recommended to ensure presence of WiFi fingerprints for the empty spots for the training set like 5th Floor of the 3rd building, some spots in the second and first building.
Check plot for only training set in Coverage chapter
For better results it is also recommended to dipdive into the points, mentioned in the Next steps chapter.